NYC Housing Sale Price Visualization

Author: Hao-Li Huang
Date Created: Nov 29, 2020

1. Load data

The NYC neighborhood map as a geojson file is taken from:
https://data.cityofnewyork.us/City-Government/Neighborhood-Tabulation-Areas-NTA-/cpf4-rkhq

Property sales data (Nov 2019 - Oct 2020) is taken from the NYC Department of Finance:
https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page

The correlation between the borough-block-lot (bbl) number and the 2010 census tract is taken from the NYC Pluto data downloader:
https://chriswhong.github.io/plutoplus

Finally, the correlation between the 2010 census tract to the 2010 Neighborhood Tabulation Area (NTA) is taken from the NYC Department of City Planing:
https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-nynta.page

2. Data Cleaning

2.1 df: Property sales data

2.1.1 Sale price

Let's first clean up df. We see that there are some extremely low sale prices, indicating transfers of ownerships with little cash considerations. These transfers should not be considered in this analysis.
Here, I consider only the rows with a sale price > 100,000.

Next, I noticed that there are duplicates of sale prices, typically on the same date, in the same neighborhood, and in the same block or nearby. It looks like multiple properties are traded at the same time, and the sale price listed after each property is likely the total sale price, rather than the individual sale price.

We see that there are various number of properties that are traded at the same time. Large number of trades may be done not by regular home-buyers, but investors. Here, to make this analysis more relevant to regular home-buyers, I decide to drop any rows that have a duplicate more than three.

Now there are at most three duplicates. However, the sale prices still reflect the total price, rather than the individual prices. Next, I divide the total price by the number of properties traded to calculate the individual price.

2.1.2 Tax class

Next, we are going to select only the tax classes 1 and 2, which correspond to residential housings.

2.1.3 bbl

The next task is to map the properties to the geojson map. The key used in the geojson file here is the NYC NTA (Neighborhood Tabulation Area) code, or NTA name.

To find out which NTA each property corresponds to, I chose to use the bbl (borough-block-lot) number to find the corresponding 2010 census tract, and then find the corresponding NTA.

A NYC bbl number uniquely identifies the location of a property, so the mapping method outlined above should find the NTA of each property. However, I noticed that over the years there are restructures of some lot mumbers; two-digit lot numbers are split up to multiple four-digit lot numbers. The removal and addition of lot numbers makes it tricky to map using the whole bbl.

Therefore, istead of using the whole bbl, I use only the borough-block number, because we actually don't need the precision of bbl for this analysis.

2.2 pluto: bbl to 2010 census tract

2.2.1 bbl

2.2.2 2010 census tract

The format of 2010 census tract is different in pluto and in census. In census, it is always 6 digits, whereas in pluto it is 2 to 4 digits, and sometimes with decimal points.

Here's the rule of conversion:

For example, 838 becomes 083800, 1502 becomes 150200, 798.02 becomes 079802, and so on.

2.2.3 Indentifier

Although the census tract number is unique within the neighborhoods, it is not unique throughout NYC. I decided to create a column named 'Identifier' with a format of (borough code, census tract), which is unique in NYC.

2.3 census: 2010 census tract to NTA

The task remaining is simple. Here I just need to add a column named 'Identifier', similar to what I did in 2.2.3.

3. Choropleth Map